A Bigram Extension to Word Vector Representation

نویسندگان

ADRIAN SANBORN

JACEK SKRYZALIN

چکیده

GloVe is an algorithm which associates a vector to each word such that the dot product of two words corresponds to the likelihood they appear together in a large corpus ([PSM14]). GloVe vectors achieve state-of-the-art performance on word analogy tasks (v(king) − v(man) + v(woman) ≈ v(queen)), but they are limited to capturing meanings of individual words. In our project, we develop “biGloVe,” a version of GloVe that learns vector representations of bigrams. Using the full English Wikipedia text as our training corpus, we compute 1.2 million bigram vectors in 150 dimensions. To evaluate the quality of our biGloVe vectors, we apply them to two machine learning tasks. The first task is a 2012 SemEval challenge where one must determine the semantic similarity of two sentences or phrases. We used logistic regression using as features the cosine similarity of the average sentence (bi)GloVe vectors and found slightly better performance in one challenge when GloVe and biGlove were combined, but generally, the usage of biGloVe vectors did not increase performance. Second, we applied biGloVe vectors to classify the sentiment of movie reviews, training with naive Bayes using bag-of-words, SVMs, and random forests. We found that naive Bayes or an SVM with GloVe vectors performed the best. Applications of biGloVe vectors were hindered by insufficient bigram coverage, despite training 1.2 million vectors. At the same time, examination of nearest neighbors revealed that biGloVe vectors were indeed capturing semantic relationships unique to bigrams, suggesting that the method has promise. Training new vectors on a much larger corpus such as Common Crawl is likely to improve performance of biGloVe vectors in tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Word Pairs in Language Modeling for Information Retrieval

Previous language modeling approaches to information retrieval have focused primarily on single terms. The use of bigram models has been studied, but the restriction on word order and adjacency may not be justified for information retrieval. We propose a new language modeling approach to information retrieval that incorporates lexical affinities, or pairs of words that occur near each other, wi...

متن کامل

Interpolated Distanced Bigram Language Models for Robust Word Clustering

Two methods for interpolating the distanced bigram language model are examined which take into account pairs of words that appear at varying distances within a context. The language models under study yield a lower perplexity than the baseline bigram model. A word clustering algorithm based on mutual information with robust estimates of the mean vector and the covariance matrix is employed in t...

متن کامل

New language models using phrase structures extracted from parse trees

This paper proposes a new speech recognition scheme using three linguistic constraints. Multi-class composite bigram models [1] are used in the first and second passes to reflect word-neighboring characteristics as an extension of conventional word n-gram models. Trigram models with constituent boundary markers and word pattern models are both used in the third pass to utilize phrasal constrain...

متن کامل

Semantic Composition and Decomposition: From Recognition to Generation

Semantic composition is the task of understanding the meaning of text by composing the meanings of the individual words in the text. Semantic decomposition is the task of understanding the meaning of an individual word by decomposing it into various aspects (factors, constituents, components) that are latent in the meaning of the word. We take a distributional approach to semantics, in which a ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

A Bigram Extension to Word Vector Representation

نویسندگان

چکیده

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Word Pairs in Language Modeling for Information Retrieval

Interpolated Distanced Bigram Language Models for Robust Word Clustering

New language models using phrase structures extracted from parse trees

Semantic Composition and Decomposition: From Recognition to Generation

عنوان ژورنال:

اشتراک گذاری